Toward Suboptimal Feature Selection
نویسنده
چکیده
Feature selection can be defined as the problem of finding the optimal subset of features that meet some criterion, but in general it is an ill-posed problem. In practice, the only way to compare feature selection algorithms has been through generalization error. For this reason it is extremely rare to see optimality discussed in the context of feature selection. The first ones to make optimality claims in the literature were Koller and Sahami [2]. They propose a Markov blanket criterion for eliminating features, an approach that is not unlike learning a graph structure by finding conditional independencies in the data. A Markov blanket is defined as set of features M such that M ⊂ X and Xi / ∈ M is a Markov blanket for feature Xi if (Xi ⊥ X −M − {Xi}| M) [5]. This is equivalent to saying that the Markov blanket M subsumes all of the information contained in feature Xi not just with respect to the labels Y but with respect to all other features. In this way the Markov blanket can detect both irrelevant and redundant features. Koller and Sahami contend that the Markov blanket criterion is the optimal solution to the feature selection problem because it only removes features that are unnecessary, and conversely it removes all unnecessary features. The first part of the optimality claim is extremely persuasive. There should be no reason to keep a feature if there is another set of features that subsume all of the information it contains. The converse, however, is less convincing. To illustrate where this type of feature selection criterion should succeed, consider the artificial problem where you have n binary features and the label is the XOR of the first two features, which are independent of each other. The remaining features are either correlated with the label or are noise. A Markov Blanket algorithm should be able to remove not just the noise features but the correlated features as well. Because the first two features are all we need to determine the label precisely, additional features can only decrease the performance of the classifier. In this case the result is optimal in just about every sense. Unfortunately, in the real world this notion of optimality is not enough. The data is limited and noisy, features are often too plentiful, and rarely is there a deterministic relationship between a feature subset and the labels. In the absence of a clear-cut situation, the trade-off between model complexity and good classification is one that I believe can only be resolved by looking at the generalization error. The idea that there is some objective way to eliminate features independently of both generalization and an induction algorithm is appealing, but it contradicts all of the intuition I have developed while working with feature selection. In this paper I attempt to compare feature selection via Markov blankets to Boosting. I then
منابع مشابه
Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملPrivacy Aware Feature Selection: An Application to Protecting Motion Data
Advances in machine learning provide the ability to predict personal data from seemingly unrelated sources. We focus on privacy leaks from providing motion data from a smartphone and seek to understand the risk to personal privacy. We collect a data set containing 74 statistical features from various motion sensors continuously collected from 88 subjects providing over 40 hours of data. An idea...
متن کاملSELECTION A Dissertation by CHAO SIMA
Small Sample Feature Selection. (May 2006) Chao Sima, B.Eng., Xi’an Jiaotong University Chair of Advisory Committee: Dr. Edward R. Dougherty High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selectio...
متن کاملImpact of error estimation on feature selection
Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimat...
متن کاملExplicit Max Margin Input Feature Selection for Nonlinear SVM using Second Order Methods
Incorporating feature selection in nonlinear SVMs leads to a large and challenging nonconvex minimization problem, which can be prone to suboptimal solutions. We use a second order optimization method that utilizes eigenvalue information and is less likely to get stuck at suboptimal solutions. We devise an alternating optimization approach to tackle the problem efficiently, breaking it down int...
متن کامل